Transforming the Web into Data (with Python)

Agenda

  • conceptual introduction to web scraping
  • tools for non-programmers
  • tools for python programmers
  • code tour
  • break
  • scrape from scratch exercise

why scrape the web

  • there is a lot of human activity on the web, which produces
  • new and unique data/traces, that can lead to
  • insight & understanding for data science, the social sciences, and the humanities.
  • Ground Truthiness - remember the web is only a particular representation of human behavior
  • You can also scrape for fun & profit 💰

so what is scraping the web?

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox. - Wikipedia

conceptual introduction to web scraping

  • there roughly three steps - results may vary
    1. fetching resources - asking a computer "hey, can you send me http://google.com?"
    2. parsing documents - creating a machine readable representation of a web page
    3. extracting data - pulling out just the information of interest

fetching resources

  • Hyper Text Transfer Protocol (HTTP)
    • fundamentally about requests & responses
    • the language of the web
    • four request methods: GET, POST, PUT, DELETE
    • URLs point to resources
  • verbs & nouns
    • request methods are the verbs
    • resouces are the nouns
    • URLs are the proper nouns
  • stateless
    • doesn't have a good memory
    • sessions - how HTTP servers remember "state"
    • cookies - the token passed in HTTP requests & responses

fetching resources

  • web pages - made for humans
  • APIs - made for machines
    • Application Programming Interface - fancy name for how computer machines connect with each other
    • how to get data from the social web (i.e. Twitter, Facebook, etc.)
    • related, but distinct from web scraping (more structured, access control)

parsing documents

  • HTML documents are composed of elements or tags
    • the <html> tag is the root of the tree
  • the HTML specification defines a bunch of tags
    • <p>this is a paragraph tag with text <em>inside</em> of it</p>
    • <a href="http://pitt.edu">This is an anchor tag, basically a link</a>
    • not enough time to review all of them
  • parsing transforms the barf into a tree of elements
<!DOCTYPE html>
<html>
  <head>
    <title>A basic webpage</title>
  <body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
    <ul>
      <li>First item in a unordered list</li>
      <li>Second item in an unordered list</li>
      <div class="stuff">
        <p>Another paragraph separated by a div element.</p>
      </div>
      <table>
  </body>
</html>

In [8]:
! curl -s http://pitt.edu | head -n30


<!DOCTYPE html>
<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="lt-ie9 lt-ie8 lt-ie7"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="lt-ie9 lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><!--><html  lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#"><!--<![endif]-->

<head profile="http://www.w3.org/1999/xhtml/vocab">
  <meta charset="utf-8" />
<link rel="shortcut icon" href="http://www.pitt.edu/sites/default/files/pitt_favicon_0.ico" type="image/vnd.microsoft.icon" />
<link rel="shortlink" href="/node/62" />
<link rel="canonical" href="/home" />
<meta name="Generator" content="Drupal 7 (http://drupal.org)" />
  <title>Home | University of Pittsburgh</title>
  <meta name="description" content="The University of Pittsburgh is among the nation's most distinguished comprehensive universities, with a wide variety of high-quality programs in both the arts and sciences and professional fields." />
  <meta name="Keywords" content="University, Pittsburgh, Pitt, College, Learning, Research, Students, Undergraduate, Graduate" />
    
      <meta name="MobileOptimized" content="width">
    <meta name="HandheldFriendly" content="true">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="cleartype" content="on">
	
  <link type="text/css" rel="stylesheet" href="http://www.pitt.edu/sites/default/files/css/css_kShW4RPmRstZ3SpIC-ZvVGNFVAi0WEMuCnI0ZkYIaFw.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://www.pitt.edu/sites/default/files/css/css_vZ_wrMQ9Og-YPPxa1q4us3N7DsZMJa-14jShHgRoRNo.css" media="screen" />
<link type="text/css" rel="stylesheet" href="http://www.pitt.edu/sites/default/files/css/css_4mmZo2I5oU53mjQh0UjgygKazedTCqZXNvrxFyYrT-g.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://www.pitt.edu/sites/default/files/css/css_hME6weH8liYUm6qr-IDiSXVwXgjKndoDaEQ2Jq3-W10.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://www.pitt.edu/sites/default/files/css/css_LwvUaww9zeUDxZ1r2K4dHcSbAEEzbSNA-5Zz2KIgwD4.css" media="all" />
   
  <script src="http://www.pitt.edu/sites/all/modules/jquery_update/replace/jquery/1.5/jquery.js?v=1.5.2"></script>
<script src="http://www.pitt.edu/misc/jquery.once.js?v=1.2"></script>

extracting data

  • ok, now we are going to get really technical
  • pull information out of the tree and push it somewhere else

extracting data

  • how?
    • copy & paste
    • automated scripts
  • if you have a lot of data, copy & paste probably won't work for you
  • if the data are on multiple pages, you will need to crawl with a spider
  • web crawlers extract the links from a web page, fetch those pages, extract links, fetch, extract, fetch...
  • scripts and tools help automate this process

extracting data

  • first step: where are the data in the HTML tree?
  • right-click & select "inspect element" - works in Firefox, Chrome, & if developer mode enabled in Safari
    • THE MATRIX

extracting data

  • selection is the key
  • many different ways to select HTML tags

basic workflow

  1. fetch pages
  2. extract data
  3. extract links
  4. fetch more pages
  5. ...
  6. profit?

DATA CLEANING!

challenges in web scraping

  • logins, paywalls, and access control
    • these are not impossible, tools support HTTP sessions & cookies
    • throttling - fine line between scraping & denial of service attack
    • THE LAW - read the terms of service, copyright? FAIR USE!
  • dynamic websites
    • javascript - hard to scrape because the DOM changes
    • AJAX or XMLHttpRequest - pages can asyncronously fetch data & update themselves
  • the document vs. application centric web
    • scraping gmail?
    • APIs help, if they exist
  • mobile web / apps????
    • ¯\(ツ)

tools for non-programmers

  • Import.io - A commercial product, not sure how expensive
  • Wget - Swiss army knife of web scraping tools, command line
  • HTTrack - windows tool for copying websites, GUI
  • ScraperWiki - a service that costs money, if you have a grant...
  • Scraper Plugin – chrome plugin instad of a service, looks pretty easy to use
  • Diffbot – more advanced extraction, really nice guys, costs money but they support research if you ask them
  • EMAIL – sometimes it doesn't hurt to ask!

tools for python programmers

fetching

parsing & extracting

  • Beautiful Soup - the most popular library
  • Soupy - a wrapper around BS to make life easier
  • Scrapely - another tool for extracting structured data from web pages
  • lxml - a bit lower level, supports XPATH, which I prefer

data management

Where to go next?

Learning Python

Web Scraping

  • The Ultimate Guide to Web Scraping - A short book that provides a conceptual introduction to web scraping
  • Mining the Social Web, 2nd Edition - An excellent book for more advanced programmers who are interested in collecting and analyzing data from social websites like Twitter, Facebook, and Github.
  • Web Scraping with Python - A new book coming summer of 2015 that appears to cover the more technical aspects of scraping the web with python.
  • Google - Again, seriously, there are a million tutorials on the web. Some are more technical than others.